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Abstract. The generalizability of empirical findings to new environ¬ 
ments, settings or populations, often called “external validity,” is es¬ 
sential in most scientific explorations. This paper treats a particular 
problem of generalizability, called “transportability,” defined as a li¬ 
cense to transfer causal effects learned in experimental studies to a new 
population, in which only observational studies can be conducted. We 
introduce a formal representation called “selection diagrams” for ex¬ 
pressing knowledge about differences and commonalities between pop¬ 
ulations of interest and, using this representation, we reduce questions 
of transportability to symbolic derivations in the do-calculus. This re¬ 
duction yields graph-based procedures for deciding, prior to observing 
any data, whether causal effects in the target population can be in¬ 
ferred from experimental findings in the study population. When the 
answer is affirmative, the procedures identify what experimental and 
observational findings need be obtained from the two populations, and 
how they can be combined to ensure bias-free transport. 

Key words and phrases: Experimental design, generalizability, causal 
effects, external validity. 


1. INTRODUCTION: THREATS VS. 
ASSUMPTIONS 

Science is about generalization, and generalization 
requires that conclusions obtained in the laboratory 
be transported and applied elsewhere, in an envi¬ 
ronment that differs in many aspects from that of 
the laboratory. 

Clearly, if the target environment is arbitrary, 
or drastically different from the study environment 
nothing can be transferred and scientific progress 
will come to a standstill. However, the fact that 
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most studies are conducted with the intention of ap¬ 
plying the results elsewhere means that we usually 
deem the target environment sufficiently similar to 
the study environment to justify the transport of 
experimental results or their ramifications. 

Remarkably, the conditions that permit such 
transport have not received systematic formal treat¬ 
ment. In statistical practice, problems related to 
combining and generalizing from diverse studies are 
handled by methods of meta analysis (Glass (1976); 
Hedges and Olkin (1985); Owen (2009)), or hierar¬ 
chical models (Gelman and Hill (2007)), in which re¬ 
sults of diverse studies are pooled together by stan¬ 
dard statistical procedures (e.g., inverse-variance 
reweighting in meta-analysis, partial pooling in hi¬ 
erarchical modeling) and rarely make explicit dis¬ 
tinction between experimental and observational 
regimes; performance is evaluated primarily by sim¬ 
ulation. 

To supplement these methodologies, our paper 
provides theoretical guidance in the form of limits on 
what can be achieved in practice, what problems are 
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likely to be encountered when populations differ sig¬ 
nificantly from each other, what population differ¬ 
ences can be circumvented by clever design and what 
differences constitute theoretical impediments, pro¬ 
hibiting generalization by any means whatsoever. 

On the theoretical front, the standard literature 
on this topic, falling under rubrics such as “exter¬ 
nal validity” (Campbell and Stanley (1963), Man- 
ski (2007)), “heterogeneity” (Hofler, Gloster and 
Hoyer (2010)), “quasi-experiments” (Shadish, Cook 
and Campbell (2002), Chapter 3; Adelman (1991)), 1 
consists primarily of “threats,” namely, explanations 
of what may go wrong when we try to transport re¬ 
sults from one study to another while ignoring their 
differences. Rarely do we find an analysis of “licens¬ 
ing assumptions,” namely, formal conditions under 
which the transport of results across differing envi¬ 
ronments or populations is licensed from first prin¬ 
ciples. 2 

The reasons for this asymmetry are several. First, 
threats are safer to cite than assumptions. He 
who cites “threats” appears prudent, cautious and 
thoughtful, whereas he who seeks licensing assump¬ 
tions risks suspicions of attempting to endorse those 
assumptions. 

Second, assumptions are self-destructive in their 
honesty. The more explicit the assumption, the more 
criticism it invites, for it tends to trigger a richer 
space of alternative scenarios in which the assump¬ 
tion may fail. Researchers prefer therefore to declare 
threats in public and make assumptions in private. 

Third, whereas threats can be communicated in 
plain English, supported by anecdotal pointers to 
familiar experiences, assumptions require a formal 


1 Manski (2007) defines “external validity” as follows: “An 
experiment is said to have “external validity” if the distribu¬ 
tion of outcomes realized by a treatment group is the same as 
the distribution of outcome that would be realized in an ac¬ 
tual program.” Campbell and Stanley (1963), page 5, take a 
slightly broader view: ““External validity” asks the question 
of generalizability: to what populations, settings, treatment 
variables, and measurement variables can this effect be gen¬ 
eralized?” 

2 Hernan and VanderWeele (2011) studied such conditions 
in the context of compound treatments, where we seek to 
predict the effect of one version of a treatment from experi¬ 
ments with a different version. Their analysis is a special case 
of the theory developed in this paper (Petersen (2011)). A 
related application is reported in Robins, Orellana and Rot- 
nitzky (2008) where a treatment strategy is extrapolated be¬ 
tween two biological similar populations under different ob¬ 
servational regimes. 


language within which the notion “environment” (or 
“population”) is given precise characterization, and 
differences among environments can be encoded and 
analyzed. 

The advent of causal diagrams (Wright (1921); 
Heise (1975); Davis (1984); Verma and Pearl (1988); 
Spirtes, Glymour and Schemes (1993); Pearl (1995)) 
together with models of interventions (Haavelmo 
(1943); Strotz and Wold (I960)) and counterfactu- 
als (Neyman (1923); Rubin (1974); Robins (1986); 
Balke and Pearl (1995)) provides such a language 
and renders the formalization of transportability 
possible. 

Armed with this language, this paper departs 
from the tradition of communicating “threats” and 
embarks instead on the task of formulating “licenses 
to transport,” namely, assumptions that, if they held 
true, would permit us to transport results across 
studies. 

In addition, the paper uses the inferential ma¬ 
chinery of the do-calculus (Pearl (1995); Roller and 
Friedman (2009); Huang and Valtorta (2006); Sh- 
pitser and Pearl (2006)) to derive algorithms for de¬ 
ciding whether transportability is feasible and how 
experimental and observational findings can be com¬ 
bined to yield unbiased estimates of causal effects in 
the target population. 

The paper is organized as follows. In Section 2, 
we review the foundations of structural equations 
modeling (SEM), the question of identifiability and 
the do-calculus that emerges from these founda¬ 
tions. (This section can be skipped by readers fa¬ 
miliar with these concepts and tools.) In Section 3, 
we motivate the question of transportability through 
simple examples, and illustrate how the solution de¬ 
pends on the causal story behind the problem. In 
Section 4, we formally define the notion of trans¬ 
portability and reduce it to a problem of symbolic 
transformations in do-calculus. In Section 5, we pro¬ 
vide a graphical criterion for deciding transporta¬ 
bility and estimating transported causal effects. We 
conclude in Section 6 with brief discussions of re¬ 
lated problems of external validity, these include sta¬ 
tistical transportability, and meta-analysis. 

2. PRELIMINARIES: THE LOGICAL 
FOUNDATIONS OF CAUSAL INFERENCE 

The tools presented in this paper were developed 
in the context of nonparametric Structural Equa¬ 
tions Models (SEM), which is one among several 
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approaches to causal inference, and goes back to 
(Haavelmo (1943); Strotz and Wold (I960)). Other 
approaches include, for example, potential-outcomes 
(Rubin (1974)), Structured Tree Graphs (Robins 
(1986)), decision analytic (Dawid (2002)), Causal 
Bayesian Networks (Spirtes, Glymour and Schemes 
(2000); Pearl (2000), Chapter 1; Bareinboim, Brito 
and Pearl (2012)), and Settable Systems (White and 
Chalak (2009)). We will first describe the generic 
features common to all such approaches, and then 
summarize how these features are represented in 
SEM. 3 

2.1 Causal Models as Inference Engines 

From a logical viewpoint, causal analysis relies on 
causal assumptions that cannot be deduced from 
(nonexperimental) data. Thus, every approach to 
causal inference must provide a systematic way of 
encoding, testing and combining these assumptions 
with data. Accordingly, we view causal modeling as 
an inference engine that takes three inputs and pro¬ 
duces three outputs. The inputs are: 

1-1. A set A of qualitative causal assumptions which 
the investigator is prepared to defend on sci¬ 
entific grounds, and a model Ma that encodes 
these assumptions mathematically. (In SEM, 
Ma takes the form of a diagram or a set of un¬ 
specified functions. A typical assumption is that 
no direct effect exists between a pair of vari¬ 
ables (known as exclusion restriction), or that 
an omitted factor, represented by an error term, 
is independent of other such factors observed or 
unobserved, known as well as unknown. 

1-2. A set Q of queries concerning causal or coun- 
terfactual relationships among variables of in¬ 
terest. In linear SEM, Q concerned the magni¬ 
tudes of structural coefficients but, in general, 
Q may address causal relations directly, for ex¬ 
ample: 

Q i: What is the effect of treatment X on out¬ 
come V? 

Q 2 : Is this employer practicing gender discrim¬ 
ination? 


3 We use the acronym SEM for both parametric and non- 
parametric representations though, historically, SEM practi¬ 
tioners preferred the former (Bollen and Pearl (2013)). Pearl 
(2011) has used the term Structural Causal Models (SCM) 
to eliminate this confusion. While comparisons of the various 
approaches lie beyond the scope of this paper, we nevertheless 
propose that their merits be judged by the extent to which 
each facilitates the functions described below. 


In principle, each query Qi £ Q should be “well 
defined,” that is, computable from any fully 
specified model M compatible with A. (See Def¬ 
inition 1 for formal characterization of a model, 
and also Section 2.4 for the problem of identifi¬ 
cation in partially specified models.) 

1-3. A set D of experimental or non-experimental 
data , governed by a joint probability distribu¬ 
tion presumably consistent with A. 

The outputs are: 

O-l. A set A* of statements which are the logical 
implications of A , separate from the data at 
hand. For example, that X has no effect on Y if 
we hold Z constant, or that Z is an instrument 
relative to {X, Y}. 

0-2. A set C of data-dependent claims concern¬ 
ing the magnitudes or likelihoods of the tar¬ 
get queries in Q, each contingent on A. C may 
contain, for example, the estimated mean and 
variance of a given structural parameter, or the 
expected effect of a given intervention. Auxil¬ 
iary to C, a causal model should also yield an 
estimand Qi(P ) for each query in Q, or a de¬ 
termination that Qi is not identifiable from P 
(Definition 2). 

0-3. A list T of testable statistical implications 
of A (which may or may not be part of O- 
2), and the degree g(Ti),Ti € T, to which the 
data agrees with each of those implications. 
A typical implication would be a conditional 
independence assertion, or an equality con¬ 
straint between two probabilistic expressions. 
Testable constraints should be read from the 
model Ma (see Definition 3), and used to con¬ 
firm or disconfirm the model against the data. 

The structure of this inferential exercise is shown 
schematically in Figure 1. For a comprehensive re¬ 
view on methodological issues, see Pearl (2009a, 
2012a). 

2.2 Assumptions in Nonparametric Models 

A structural equation model (SEM) M is defined 
as follows. 

Definition 1 (Structural equation model (Pearl 
(2000), page 203)). 

1. A set U of background or exogenous variables, 
representing factors outside the model, which 
nevertheless affect relationships within the model. 
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Fig. 1. Causal analysis depicted as an inference engine converting assumptions ( A ), queries ( Q ), and data ( D ) into logical 
implications (A*), conditional claims (C), and data-fitness indices ( g(T )). 


2. A set V = {V\,... , 14} of endogenous variables, 
assumed to be observable. Each of these variables 
is functionally dependent on some subset PA{ of 
U UV. 

3. A set F of functions {fi, ■ ■ ■, f n } such that each 
fi determines the value of Vj € V, Vi = fi(p&i,u). 

4. A joint probability distribution P(u) over U. 

A simple SEM model is depicted in Figure 2(a), 
which represents the following three functions: 

z = fz(u z ), 

(2.1) x = fx{z,u x ), 

V = fy{x,u Y ), 

where in this particular example, U z , Ux and Uy 
are assumed to be jointly independent but otherwise 
arbitrarily distributed. Whenever dependence exists 
between any two exogenous variables, a bidirected 
arrow will be added to the diagram to represent this 
dependence (e.g., Figure 4). 4 Each of these functions 


u zr 
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Fig. 2. 

The diagrams 

associated 

with (a) the 

structural 


model of equation (2.1) and (b) the modified model of equation 
(2.2), representing the intervention do(X = xo). 


4 More precisely, the absence of bidirected arrows implies 
marginal independences relative of the respective exogenous 


represents a causal process (or mechanism) that de¬ 
termines the value of the left variable (output) from 
the values on the right variables (inputs), and is as¬ 
sumed to be invariant unless explicitly intervened 
on. The absence of a variable from the right-hand 
side of an equation encodes the assumption that na¬ 
ture ignores that variable in the process of determin¬ 
ing the value of the output variable. For example, 
the absence of variable Z from the arguments of fy 
conveys the empirical claim that variations in Z will 
leave Y unchanged, as long as variables Uy and X 
remain constant. 

It is important to distinguish between a fully spec¬ 
ified model in which P(U) and the collection of func¬ 
tions F are specified and a partially specified model, 
usually in the form of a diagram. The former entails 
one and only one observational distribution P(V); 
the latter entails a set of observational distributions 
P(V) that are compatible with the graph (those that 
can be generated by specifying ( F,P(u ))). 

2.3 Representing Interventions, Counterfactuals 
and Causal Effects 

This feature of invariance permits us to derive 
powerful claims about causal effects and counter¬ 
factuals, even in nonparametric models, where all 
functions and distributions remain unknown. This is 
done through a mathematical operator called do(x), 
which simulates physical interventions by deleting 
certain functions from the model, replacing them 


variables. In other words, the set of all bidirected edges con¬ 
stitute an i-map of P(U) (Richardson (2003)). 
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with a constant X = x, while keeping the rest of 
the model unchanged (Haavelmo (1943); Strotz and 
Wold (1960); Pearl (2014)). For example, to emulate 
an intervention do(xo) that sets X to a constant xq 
in model M of Figure 2(a), the equation for x in 
equation (2.1) is replaced by x = xq, and we obtain 
a new model, M Xo , 

z = fz(u z ), 

(2.2) x = x 0 , 

V = fy{x,u Y ), 

the graphical description of which is shown in Fig¬ 
ure 2(b). 

The joint distribution associated with this mod¬ 
ified model, denoted P(z, y\ do(xo)) describes the 
post-intervention distribution of variables Y and 
Z (also called “controlled” or “experimental” dis¬ 
tribution), to be distinguished from the preinter¬ 
vention distribution, P(x,y,z), associated with the 
original model of equation (2.1). For example, if 
X represents a treatment variable, Y a response 
variable, and Z some covariate that affects the 
amount of treatment received, then the distribution 
P(z,y |do(xo)) gives the proportion of individuals 
that would attain response level Y = y and covari¬ 
ate level Z = z under the hypothetical situation in 
which treatment X = xq is administered uniformly 
to the population. 5 

In general, we can formally define the postinter¬ 
vention distribution by the equation 

(2.3) P M (y\do(x)) =P Mx (y)- 

In words, in the framework of model M, the postin¬ 
tervention distribution of outcome Y is defined as 
the probability that model M x assigns to each out¬ 
come level Y = y. From this distribution, which is 
readily computed from any fully specified model M, 
we are able to assess treatment efficacy by compar¬ 
ing aspects of this distribution at different levels 
of xq. 6 


5 Equivalently, P(z, y\ do(xo)) can be interpreted as the 
joint probability of (Z = z,Y = y ) under a randomized ex¬ 
periment among units receiving treatment level X = xo- 
Readers versed in potential-outcome notations may interpret 
P(y | do(a;),z) as the probability P(Y X = y\Z x = z), where Y x 
is the potential outcome under treatment X = x. 

6 Counterfactuals are defined similarly through the equation 
Y x (u) = Ym x {u) (see Pearl (2009b), Chapter 7), but will not 
be needed for the discussions in this paper. 


2.4 Identification, d-Separation and Causal 
Calculus 

A central question in causal analysis is the ques¬ 
tion of identification of causal queries (e.g., the effect 
of intervention do (A = xq)) from a combination of 
data and a partially specified model, for example, 
when only the graph is given and neither the func¬ 
tions F nor the distribution of U. In linear paramet¬ 
ric settings, the question of identification reduces 
to asking whether some model parameter, (3, has a 
unique solution in terms of the parameters of P (say 
the population covariance matrix). In the nonpara- 
metric formulation, the notion of “has a unique so¬ 
lution” does not directly apply since quantities such 
as Q{M) = P(y |do(x)) have no parametric signa¬ 
ture and are defined procedurally by simulating an 
intervention in a causal model M, as in equation 
(2.2). The following definition captures the require¬ 
ment that Q be estimable from the data: 

Definition 2 (Identifiability). A causal query 
Q{M ) is identifiable, given a set of assumptions A, 
if for any two (fully specified) models, Mi and M 2 , 
that satisfy A, we have ' 

(2.4) P(M 1 ) = P(M 2 )^Q(M 1 ) = Q(M 2 ). 

In words, the functional details of M\ and M 2 do 
not matter; what matters is that the assumptions 
in A (e.g., those encoded in the diagram) would 
constrain the variability of those details in such a 
way that equality of P’s would entail equality of 
Q’s. When this happens, Q depends on P only, and 
should therefore be expressible in terms of the pa¬ 
rameters of P. 

When a query Q is given in the form of a do- 
expression, for example, Q = P(y\ do(x), z), its iden¬ 
tifiability can be decided systematically using an al¬ 
gebraic procedure known as the do-calculus (Pearl 
(1995)). It consists of three inference rules that per¬ 
mit us to map interventional and observational dis¬ 
tributions whenever certain conditions hold in the 
causal diagram G. 

The conditions that permit the application these 
inference rules can be read off the diagrams using 


'An implication similar to (2.4) is used in the standard 
statistical definition of parameter identification, where it con¬ 
veys the uniqueness of a parameter set 9 given a distribution 
Pg (Lehmann and Casella (1998)). To see the connection, one 
should think about the query Q = P(y |do(x)) as a function 
Q = g(9) where 9 is the pair F U P(u) that characterizes a 
fully specified model M. 
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a graphical criterion known as d-separation (Pearl 
(1988)). 

Definition 3 (d-separation). A set S of nodes 
is said to block a path p if either 

1. p contains at least one arrow-emitting node that 
is in S , or 

2. p contains at least one collision node that is out¬ 
side S and has no descendant in S. 

If S blocks all paths from set X to set Y, it is said to 
“d-separate X and Y,” and then, it can be shown 
that variables X and Y are independent given S, 
written X_U_y|5. 8 

D-separation reflects conditional independencies 
that hold in any distribution P(v) that is compatible 
with the causal assumptions A embedded in the di¬ 
agram. To illustrate, the path U z — > Z — > X —> Y in 
Figure 2(a) is blocked by S = { Z} and by S = {AT}, 
since each emits an arrow along that path. Conse¬ 
quently, we can infer that the conditional indepen¬ 
dencies U Z -W-Y\Z and U Z ALY\X will be satisfied in 
any probability function that this model can gener¬ 
ate, regardless of how we parameterize the arrows. 
Likewise, the path U z — > Z — > X Ux is blocked by 
the null set {0}, but it is not blocked by S = {Y} 
since Y is a descendant of the collision node X. Con¬ 
sequently, the marginal independence U z ALUx will 
hold in the distribution, but U z JLUx\Y may or may 
not hold. 9 

2.5 The Rules of do-Calculus 

Let X, Y, Z and W be arbitrary disjoint sets 
of nodes in a causal DAG G. We denote by 
the graph obtained by deleting from G all arrows 
pointing to nodes in X. Likewise, we denote by Gx 
the graph obtained by deleting from G all arrows 
emerging from nodes in X. To represent the deletion 
of both incoming and outgoing arrows, we use the 
notation G^ z . 

The following three rules are valid for every inter¬ 
ventional distribution compatible with G: 


8 See Hayduk et al. (2003), Glymour and Greenland (2008) 
and Pearl (2009b), page 335, for a gentle introduction to d- 
separation. 

9 This special handling of collision nodes (or colliders, e.g., 

Z —> X «— Ux) reflects a general phenomenon known as Berk- 
son’s paradox (Berkson (1946)), whereby observations on a 
common consequence of two independent causes render those 
causes dependent. For example, the outcomes of two indepen¬ 
dent coins are rendered dependent by the testimony that at 
least one of them is a tail. 


Rule 1 (Insertion/deletion of observations). 

P(y\do(x),z,w) 

(2.5) 

= P(y | do(x), w) if (YALZ\X, W) G -. 

Rule 2 (Action/observation exchange). 

P(y\do(x),do(z),w) 

( 2 . 6 ) 

= P(y\do(x),z,w) if (Y1LZ\X,W) G -z. 

Rule 3 (Insertion/deletion of actions). 

P(y\do(x),do(z),w) 

(2.7) 

= P(y\do(x),w) \Uy&.Z\X,W) Gimm , 

where Z(W) is the set of Z-nodes that are not an¬ 
cestors of any W-node in Gj^. 

To establish identifiability of a query Q , one needs 
to repeatedly apply the rules of do-calculus to Q, 
until the final expression no longer contains a do- 
operator; 10 this renders it estimable from nonexper- 
imental data. The do-calculus was proven to be com¬ 
plete for the identifiability of causal effects in the 
form Q = P{y\do(x),z) (Shpitser and Pearl (2006); 
Huang and Valtorta (2006)), which means that if Q 
cannot be expressed in terms of the probability of 
observables P by repeated application of these three 
rules, such an expression does not exist. In other 
words, the query is not estimable from observational 
studies without making further assumptions, for ex¬ 
ample, linearity, monotonicity, additivity, absence of 
interactions, etc. 

We shall see that, to establish transportability, 
the goal will be different; instead of eliminating do- 
operators from the query expression, we will need to 
separate them from a set of variables S that repre¬ 
sent disparities between populations. 

3. INFERENCE ACROSS POPULATIONS: 

MOTIVATING EXAMPLES 

To motivate the treatment of Section 4, we first 
demonstrate some of the subtle questions that trans¬ 
portability entails through three simple examples, 
informally depicted in Figure 3. 

Example 1. Consider the graph in Figure 3(a) 
that represents cause-effect relationships in the pre¬ 
treatment population in Los Angeles. We conduct 


10 Such derivations are illustrated in graphical details in 
Pearl (2009b), page 87. 
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Fig. 3. Causal diagrams depicting Examples 1-3. In (a) Z represents “age.” In (b), Z represents “linguistic skills” while age 
(in hollow circle) is unmeasured. In (c), Z represents a biological marker situated between the treatment ( X ) and a disease ( Y). 


a randomized trial in Los Angeles and estimate the 
causal effect of exposure X on outcome Y for every 
age group Z = z. 11,12 We now wish to generalize the 
results to the population of New York City (NYC), 
but data alert us to the fact that the study distribu¬ 
tion P(x,y,z ) in LA is significantly different from 
the one in NYC (call the latter P*(x,y,z)). In par¬ 
ticular, we notice that the average age in NYC is 
significantly higher than that in LA. How are we 
to estimate the causal effect of X on Y in NYC, 
denoted P*(y\ do(x))? 

Our natural inclination would be to assume that 
age-specific effects are invariant across cities and so, 
if the LA study provides us with (estimates of) age- 
specific causal effects P(y | do(x), Z = z), the overall 
causal effect in NYC should be 

(3.1) P*(y\ do(s)) = ^2 P(y\ do (x),z)P*(z). 

Z 

This transport formula combines experimental re¬ 
sults obtained in LA, P(y\do(x),z), with observa¬ 
tional aspects of NYC population, P*(z), to obtain 
an experimental claim P*(y |do(x)) about NYC. 13 

Our first task in this paper will be to explicate the 
assumptions that renders this extrapolation valid. 


11 Throughout the paper, each graph represents the causal 
structure of the population prior to the treatment, hence X 
stands for the level of treatment taken by an individual out 
of free choice. 

12 The arrow from Z to X represents the tendency of older 
people to seek treatment more often than younger people, 
and the arrow from Z to Y represents the effect of age on the 
outcome. 

13 At first glance, equation (3.1) may be regarded as a rou¬ 
tine application of “standardization” or “recalibration”—a 
statistical extrapolation method that can be traced back to a 
century-old tradition in demography and political arithmetic 
(Westergaard (1916); Yule (1934); Lane and Nelder (1982)). 
On a second thought, it raises the deeper question of why we 
consider age-specific effects to be invariant across populations. 
See discussion following Example 2. 


We ask, for example, what must we assume about 
other confounding variables beside age, both latent 
and observed, for equation (3.1) to be valid, or, 
would the same transport formula hold if Z was 
not age, but some proxy for age, say, language pro¬ 
ficiency. More intricate yet, what if Z stood for 
an exposure-dependent variable, say hyper-tension 
level, that stands between X and Y? 

Let us examine the proxy issue first. 

Example 2. Let the variable Z in Example 1 
stand for subjects language proficiency, and let us 
assume that Z does not affect exposure ( X ) or 
outcome (Y), yet it correlates with both, being 
a proxy for age which is not measured in either 
study [see Figure 3(b)]. Given the observed dispar¬ 
ity P(z) / P*(z), how are we to estimate the causal 
effect P*(y\ do(x)) for the target population of NYC 
from the ^-specific causal effect P(y\do(x),z) esti¬ 
mated at the study population of LA? 

The inequality P(z) ^ P*(z ) in this example may 
reflect either age difference or differences in the way 
that Z correlates with age. If the two cities enjoy 
identical age distributions and NYC residents ac¬ 
quire linguistic skills at a younger age, then since Z 
has no effect whatsoever on X and Y, the inequal¬ 
ity P(z) / P*(z) can be ignored and, intuitively, the 
proper transport formula would be 

(3.2) P*{y\ do(x)) = P(y\ do(x)). 

If, on the other hand, the conditional probabilities 
P(z |age) and P*(z |age) are the same in both cities, 
and the inequality P(z) ^ P*(z) reflects genuine age 
differences, equation (3.2) is no longer valid, since 
the age difference may be a critical factor in deter¬ 
mining how people react to X. We see, therefore, 
that the choice of the proper transport formula de¬ 
pends on the causal context in which population dif¬ 
ferences are embedded. 
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This example also demonstrates why the invari¬ 
ance of Z-specific causal effects should not be taken 
for granted. While justified in Example 1, with Z = 
age, it fails in Example 2, in which Z was equated 
with “language skills.” Indeed, using Figure 3(b) for 
guidance, the Z-specific effect of X on Y in NYC is 
given by 

P*(y\do{x),z) 

= ^2 P*{y\ do (x),z, age)P*(age| do (x),z) 

age 

= ^P*{y\ do(z), age)P*(age| 2 ) 

age 

= P ( y \ do(x), age)P*(age|z). 

age 

Thus, if the two populations differ in the relation 
between age and skill, that is, 

P(age|z) /P*(age|z) 

the skill-specific causal effect would differ as well. 

The intuition is clear. A NYC person at skill level 
Z = z is likely to be in a totally different age group 
from his skill-equals in Los Angeles and, since it is 
age, not skill that shapes the way individuals re¬ 
spond to treatment, it is only reasonable that Los 
Angeles residents would respond differently to treat¬ 
ment than their NYC counterparts at the very same 
skill level. 

The essential difference between Examples 1 and 
2 is that age is normally taken to be an exoge¬ 
nous variable (not assigned by other factors in the 
model) while skills may be indicative of earlier fac¬ 
tors (age, education, ethnicity) capable of modifying 
the causal effect. Therefore, conditional on skill, the 
effect may be different in the two populations. 

Example 3. Examine the case where Z is a 
Y-dependent variable, say a disease bio-marker, 
standing on the causal pathways between X and Y 
as shown in Figure 3(c). Assume further that the 
disparity P(z\x) ^ P*(z\x) is discovered and that, 
again, both the average and the 2 -specific causal ef¬ 
fect P(y\do(x),z) are estimated in the LA experi¬ 
ment, for all levels of X and Z. Can we, based on in¬ 
formation given, estimate the average (or 2 -specific) 
causal effect in the target population of NYC? 

Here, equation (3.1) is wrong because the overall 
causal effect (in both LA and NYC) is no longer a 


simple average of the 2 -specific causal effects. The 
correct weighing rule is 

P*(y\ do(x)) 

(3.3) 

= ^2 p *(y\ do(s), z)P*{z\ do(s)), 

Z 

which reduces to (3.1) only in the special case where 
Z is unaffected by X. Equation (3.2) is also wrong 
because we can no longer argue, as we did in Exam¬ 
ple 2, that Z does not affect Y, hence it can be ig¬ 
nored. Here, Z lies on the causal pathway between X 
and Y so, clearly, it affects their relationship. What 
then is the correct transport formula for this sce¬ 
nario? 

To cast this example in a more realistic setting, let 
us assume that we wish to use Z as a “surrogate end¬ 
point” to predict the efficacy of treatment X on out¬ 
come Y, where Y is too difficult and/or expensive to 
measure routinely (Prentice (1989); Ellenberg and 
Hamilton (1989)). Thus, instead of considering ex¬ 
perimental and observational studies conducted at 
two different locations, we consider two such studies 
taking place at the same location, but at different 
times. In the first study, we measure P(y,z |do(x)) 
and discover that Z is a good surrogate, namely, 
knowing the effect of treatment on Z allows predic¬ 
tion of the effect of treatment on the more clinically 
relevant outcome (Y) (Joffe and Greene (2009)). 
Once Z is proclaimed a “surrogate endpoint,” it in¬ 
vites efforts to find direct means of controlling Z. 
For example, if cholesterol level is found to be a pre¬ 
dictor of heart diseases in a long-run trial, drug man¬ 
ufacturers would rush to offer cholesterol-reducing 
substances for public consumption. As a result, both 
the prior P(z ) and the treatment-dependent proba¬ 
bility P(z | do(x)) would undergo a change, resulting 
in P*(z) and P*( 2 |do(x)), respectively. 

We now wish to reassess the effect of the drug 
P*(y | do(x)) in the new population and do it in the 
cheapest possible way, namely, by conducting an ob¬ 
servational study to estimate P*(z,x), acknowledg¬ 
ing that confounding exists between X and Y and 
that the drug affects Y both directly and through 
Z, as shown in Figure 3(c). 

Using a graphical representation to encode the as¬ 
sumptions articulated thus far, and further assum¬ 
ing that the disparity observed stems only from a 
difference in people’s susceptibility to X (and not 
due to a change in some unobservable confounder), 
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we will prove in Section 5 that the correct transport 
formula should be 

(3.4) P*(y\do(x)) = y }TP{y\do(x),z)P*(z\x), 

Z 

which is different from both (3.1) and (3.2). It calls 
instead for the ^-specific effects to be reweighted by 
the conditional probability P*(z\x), estimated in the 
target population. 11 

To see how the transportability problem fits into 
the general scheme of causal analysis discussed in 
Section 2.1 (Figure 1), we note that, in our case, the 
data comes from two sources, experimental (from 
the study) and nonexperimental (from the target), 
assumptions are encoded in the form of selection di¬ 
agrams, and the query stands for the causal effect 
(e.g., P*(y\ do(x))). Although this paper does not 
discuss the goodness-of-fit problem, standard meth¬ 
ods are available for testing the compatibility of the 
selection diagram with the data available. 

4. FORMALIZING TRANSPORTABILITY 
4.1 Selection Diagrams and Selection Variables 

The pattern that emerges from the examples dis¬ 
cussed in Section 3 indicates that transportability 
is a causal, not statistical notion. In other words, 
the conditions that license transport as well as the 
formulas through which results are transported de¬ 
pend on the causal relations between the variables 
in the domain, not merely on their statistics. For 
instance, it was important in Example 3 to as¬ 
certain that the change in P(z\x) was due to the 
change in the way Z is affected by X , but not due 
to a change in confounding conditions between the 
two. This cannot be determined solely by compar¬ 
ing P(z\x ) and P*(z\x). If A and Z are confounded 
[e.g., Figure 6(e)], it is quite possible for the in¬ 
equality P{z\x) / P*(z\x) to hold, reflecting differ¬ 
ences in confounding, while the way that Z is af¬ 
fected by X (i.e., P(,z|do(x))) is the same in the 
two populations—a different transport formula will 
then emerge for this case. 

Consequently, licensing transportability requires 
knowledge of the mechanisms, or processes, through 


14 Quite often the possibility of running a second random¬ 
ized experiment to estimate P*(s|do(a:)) is also available to 
investigators, though at a higher cost. In such cases, a trans¬ 
port formula would be derivable under more relaxed assump¬ 
tions, for example, allowing for X and Z to be confounded. 


which population differences come about; differ¬ 
ent localization of these mechanisms yield different 
transport formulae. This can be seen most vividly in 
Example 2 [Figure 3(b)] where we reasoned that no 
reweighing is necessary if the disparity P(z) ^ P*{z) 
originates with the way language proficiency de¬ 
pends on age, while the age distribution itself re¬ 
mains the same. Yet, because age is not measured, 
this condition cannot be detected in the probability 
distribution P, and cannot be distinguished from an 
alternative condition, 

P( age) / P*( age) and P(z\age) = P*(z|age), 

one that may require reweighting according to equa¬ 
tion (3.1). In other words, every probability distri¬ 
bution P(x, y, z) that is compatible with the process 
of Figure 3(b) is also compatible with that of Fig¬ 
ure 3(a) and, yet, the two processes dictate different 
transport formulas. 

Based on these observations, it is clear that if we 
are to represent formally the differences between 
populations (similarly, between experimental set¬ 
tings or environments), we must resort to a represen¬ 
tation in which the causal mechanisms are explicitly 
encoded and in which differences in populations are 
represented as local modifications of those mecha¬ 
nisms. 

To this end, we will use causal diagrams aug¬ 
mented with a set, S, of “selection variables,” where 
each member of S corresponds to a mechanism by 
which the two populations differ, and switching be¬ 
tween the two populations will be represented by 
conditioning on different values of these S vari¬ 
ables. 15 

Intuitively, if P(u|do(a:)) stands for the distribu¬ 
tion of a set V of variables in the experimental 
study (with X randomized) then we designate by 
P*(v | do(x)) the distribution of V if we were to con¬ 
duct the study on population II* instead of II. We 
now attribute the difference between the two to the 


15 Disparities among populations or subpopulations can also 
arise from differences in design; for example, if two samples are 
drawn by different criteria from a given population. The prob¬ 
lem of generalizing between two such subpopulations is usu¬ 
ally called sampling selection bias (Heckman (1979); Hernan, 
Hernandez-DIaz and Robins (2004); Cole and Stuart (2010); 
Pearl (2013); Bareinboim, Tian and Pearl (2014)). In this pa¬ 
per, we deal only with nature-induced, not man-made dispar¬ 
ities. 
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Fig. 4. Selection diagrams depicting specific versions of Examples 1-3. In (a), the two populations differ in age distributions. 
In (b), the populations differs in how Z depends on age (an unmeasured variable, represented by the hollow circle) and the 
age distributions are the same. In (c), the populations differ in how Z depends on X. In all diagrams, dashed arcs (e.g., 
X <--+ Y) represent the presence of latent variables affecting both X and Y. 


action of a set S of selection variables, and write 16,17 
P*(v | do(x)) = P{v | do(x), s*). 

The selection variables in S may represent all fac¬ 
tors by which populations may differ or that may 
“threaten” the transport of conclusions between 
populations. For example, in Figure 4(a) the age dis¬ 
parity P(z) ^ P*(z) discussed in Example 1 will be 
represented by the inequality 

P(z)^P(z\s), 

where S stands for all factors responsible for draw¬ 
ing subjects at age Z = z to NYC rather than LA. 

Of equal importance is the absence of an S vari¬ 
able pointing to Y in Figure 4(a), which encodes 
the assumption that age-specific effects are invari¬ 
ant across the two populations. 

This graphical representation, which we will call 
“selection diagrams” is defined as follows: 18 


16 Alternatively, one can represent the two populations’ dis¬ 
tributions by P(v\ do(x), s), and P(v\ do(a;), s*), respectively. 
The results, however, will be the same, since only the location 
of S enters the analysis. 

11 Pearl (1993, 2009b, page 71), Spirtes, Glymour and 
Schemes (1993) and Dawid (2002), for example, use condi¬ 
tioning on auxiliary variables to switch between experimen¬ 
tal and observational studies. Dawid (2002) further uses such 
variables to represent changes in parameters of probability 
distributions. 

18 The assumption that there are no structural changes be¬ 
tween domains can be relaxed starting with D = G* and 
adding Ynodes following the same procedure as in Defini¬ 
tion 4, while enforcing acyclicity. In extreme cases in which 
the two domains differ in causal directionality (Spirtes, Gly¬ 
mour and Scheines (2000), pages 298-299), acyclicity cannot 
be maintained. This complication as well as one created when 
G is a edge-super set of G* require a more elaborated graph¬ 
ical representation and lie beyond the scope of this paper. 


Definition 4 (Selection diagram). Let ( M,M *) 
be a pair of structural causal models (Definition 1) 
relative to domains (LI, II*), sharing a causal dia¬ 
gram G. (M,M*) is said to induce a selection dia¬ 
gram D if D is constructed as follows: 

1. Every edge in G is also an edge in D. 

2. D contains an extra edge S) —> Vi whenever there 
might exist a discrepancy /* / /* or P{Ui) / 
P*(Ui) between M and M*. 

In summary, the S'-variables locate the mecha¬ 
nisms where structural discrepancies between the 
two populations are suspected to take place. Alter¬ 
natively, the absence of a selection node pointing to 
a variable represents the assumption that the mech¬ 
anism responsible for assigning value to that vari¬ 
able is the same in the two populations. In the ex¬ 
treme case, we could add selection nodes to all vari¬ 
ables, which means that we have no reason to be¬ 
lieve that the populations share any mechanism in 
common, and this, of course would inhibit any ex¬ 
change of information among the populations. The 
invariance assumptions between populations, as we 
will see, will open the door for the transport of some 
experimental findings. 

For clarity, we will represent the S variables by 
squares, as in Figure 4, which uses selection dia¬ 
grams to encode the three examples discussed in 
Section 3. (Besides the S variables, these graphs 
also include additional latent variables, represented 
by bidirected edges, which makes the examples more 
realistic.) In particular, Figures 4(a) and 4(b) repre¬ 
sent, respectively, two different mechanisms respon¬ 
sible for the observed disparity P(z) P*(z). The 
first [Figure 4(a)] dictates transport formula (3.1), 
while the second [Figure 4(b)] calls for direct, un¬ 
adjusted transport (3.2). This difference stems from 
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the location of the S variables in the two diagrams. 
In Figure 4(a), the S variable represents unspecified 
factors that cause age differences between the two 
populations, while in Figure 4(b), S represents fac¬ 
tors that cause differences in reading skills (Z) while 
the age distribution itself (unobserved) remains the 
same. 

In this paper, we will address the issue of trans¬ 
portability assuming that scientific knowledge about 
invariance of certain mechanisms is available and 
encoded in the selection diagram through the S 
nodes. Such knowledge is, admittedly, more de¬ 
manding than that which shapes the structure of 
each causal diagram in isolation. It is, however, a 
prerequisite for any attempt to justify transfer of 
findings across populations, which makes selection 
diagrams a mathematical object worthy of analysis. 

4.2 Transportability: Definitions and Examples 

Using selection diagrams as the basic represen¬ 
tational language, and harnessing the concepts of 
intervention, do-calculus, and identifiability (Sec¬ 
tion 2), we can now give the notion of transporta¬ 
bility a formal definition. 

Definition 5 (Transportability). Let D be a 
selection diagram relative to domains (II, II*). Let 
(P, I) be the pair of observational and interventional 
distributions of II, and P* be the observational 
distribution of II*. The causal relation R(I1*) = 
P*(y\ do(x), z) is said to be transportable from II 
to II* in D if i?(II*) is uniquely computable from 
P,P*,I in any model that induces D. 

Two interesting connections between identifiabil¬ 
ity and transportability are worth noting. First, note 
that all identifiable causal relations in D are also 
transportable, because they can be computed di¬ 
rectly from P* and require no experimental informa¬ 
tion from II. Second, note that given causal diagram 
G , one can produce a selection diagram D such that 
identifiability in G is equivalent to transportability 
in D. First set D = G, and then add selection nodes 
pointing to all variables in D , which represents that 
the target domain does not share any mechanism 
with its counterpart—this is equivalent to the prob¬ 
lem of identifiability because the only way to achieve 
transportability is to identify R from scratch in the 
target population. 

While the problems of identifiability and trans¬ 
portability are related, proofs of nontransportability 
are more involved than those of nonidentifiability for 


they require one to demonstrate the nonexistence of 
two competing models compatible with D, agreeing 
on {P,P*,I}, and disagreeing on i?(II*). 

Definition 5 is declarative, and does not offer 
an effective method of demonstrating transportabil¬ 
ity even in simple models. Theorem 1 offers such 
a method using a sequence of derivations in do- 
calculus. 

Theorem 1. Let D be the selection diagram 
characterizing two populations, II and II*, and S 
a set of selection variables in D. The relation R 
= P*(y\ do(x), z) is transportable from II to II* if 
the expression P{y\ do(cc), z, s ) is reducible, using the 
rules of do-calculus, to an expression in which S 
appears only as a conditioning variable in do-free 
terms. 

Proof. Every relation satisfying the condition 
of Theorem 1 can be written as an algebraic com¬ 
bination of two kinds of terms, those that involve 
S and those that do not. The former can be writ¬ 
ten as P*-terms and are estimable, therefore, from 
observations on II*, as required by Definition 5. All 
other terms, especially those involving do-operators, 
do not contain 5; they are experimentally identifi¬ 
able therefore in II. □ 

This criterion was proven to be both sufficient and 
necessary for causal effects, namely R = P*(y \ do(x)) 
(Bareinboim and Pearl (2012)). Theorem 1, though 
procedural, does not specify the sequence of rules 
leading to the needed reduction when such a se¬ 
quence exists. Bareinboim and Pearl (2013b) de¬ 
rived a complete procedural solution for this, based 
on graphical method developed in (Tian and Pearl 
(2002); Shpitser and Pearl (2006)). Despite its com¬ 
pleteness, however, the procedural solution is not 
trivial, and we take here an alternative route to es¬ 
tablish a simple and transparent procedure for con¬ 
firming transportability, guided by two recognizable 
subgoals. 

Definition 6 (Trivial transportability). A caus¬ 
al relation R is said to be trivially transportable from 
n to IT, if R(U*) is identifiable from (G*,P*). 

This criterion amounts to an ordinary test of iden¬ 
tifiability of causal relations using graphs, as given 
by Definition 2. It permits us to estimate i?(II*) di¬ 
rectly from observational studies on Ft*, unaided by 
causal information from II. 
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Example 4. Let R be the causal effect P*(y\ 
do(x)) and let the selection diagram of II and II* 
be given by X —> Y <— S, then R is trivially trans¬ 
portable, since R(U*) = P*(y\x). 

Another special case of transportability occurs 
when a causal relation has identical form in both 
domains—no recalibration is needed. 

Definition 7 (Direct transportability). A causal 
relation R is said to be directly transportable from 
n to IT, if R(U*) = R{U). 

A graphical test for direct transportability of R = 
P*(y\ do(x), z) follows from do-calculus and reads: 
(5_U_y|A, Z)g—\ in words, X blocks all paths from 
S to Y once we remove all arrows pointing to X and 
condition on Z. As a concrete example, this test is 
satisfied in Figure 4(a) and, therefore, the ^-specific 
effects is the same in both populations; it is directly 
transportable. 

Remark. The notion of “external validity” as 
defined by Manski (2007) (footnote 1) corresponds 
to Direct Transportability, for it requires that R re¬ 
tains its validity without adjustment, as in equation 
(3.2). Such conditions preclude the use of informa¬ 
tion from II* to recalibrate R. 

Example 5. Let R be the causal effect of X 
on Y, and let D have a single S node pointing to 
X, then R is directly transportable, because causal 
effects are independent of the selection mechanism 
(see Pearl (2009b), pages 72 and 73). 

Example 6. Let R be the z-specific causal ef¬ 
fect of X on Y P*(y\ do(x), z) where Z is a set 
of variables, and P and P* differ only in the con¬ 
ditional probabilities P(z\pa,(Z)) and P*(z\p&(Z)) 
such that (Z_U_y|pa(Z)), as shown in Figure 4(b). 
Under these conditions, R is not directly trans¬ 
portable. However, the pa(Z)-specific causal effects 
P*(y\ do(x),pa(Z)) are directly transportable, and 
so is P*(y|do(x)). Note that, due to the confound¬ 
ing arcs, none of these quantities is identifiable. 

5. TRANSPORTABILITY OF CAUSAL 
EFFECTS—A GRAPHICAL CRITERION 

We now state and prove two theorems that permit 
us to decide algorithmically, given a selection dia¬ 
gram, whether a relation is transportable between 
two populations, and what the transport formula 
should be. 


Theorem 2. Let D be the selection diagram 
characterizing two populations, n and n*, and S 
the set of selection variables in D. The strata- 
specific causal effect P* (y\do(x), z) is transportable 
from n to n* if Z d-separates Y from S in the 
X-manipulated version of D, that is, Z satisfies 
(YALS\Z,X)d^. 

Proof. 

P*(y\do(x),z) = P(y\do(x),z,s*). 

From Rule 1 of do-calculus we have: P(y\do(x),z, 
s*) = P(y\ do(x), z) whenever Z satisfies (YALS\Z, 
X) in D- y. This proves Theorem 2. □ 

Definition 8 (5-admissibility). A set T of vari¬ 
ables satisfying (YALS\T, X) in D will be called 
5-admissible (with respect to the causal effect of 
X on Y). 

Corollary 1. The average causal effect P*(y\ 
do(x)) is transportable from n to n* if there ex¬ 
ists a set Z of observed pretreatment covariates that 
is S-admissible. Moreover, the transport formula is 
given by the weighting of equation (3.1). 

Example 7. The causal effect is transportable 
in Figure 4(a), since Z is 5-admissible, and in Fig¬ 
ure 4(b), where the empty set is 5-admissible. It 
is also transportable by the same criterion in Fig¬ 
ure 5(b), where W is 5-admissible, but not in Fig¬ 
ure 5(a) where no 5-admissible set exists. 

Corollary 2. Any S variable that is point¬ 
ing directly into X as in Figure 6(a), or that is d- 
separated from Y in can be ignored. 

This follows from the fact that the empty set is 
5-admissible relative to any such 5 variable. Con¬ 
ceptually, the corollary reflects the understanding 
that differences in propensity to receive treatment 




Fig. 5. Selection diagrams illustrating S-admissibility, (a) 
Has no S-admissible set while in (b), W is S-admissible. 
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Fig. 6. Selection diagrams illustrating transportability. The causal effect P(j/|do(a;)) is (trivially) transportable in (c) but 
not in (b) and (f). It is transportable in (a), (d) and (e) (see Corollary 2). 


do not hinder the transportability of treatment ef¬ 
fects; the randomization used in the experimental 
study washes away such differences. 

We now generalize Theorem 2 to cases involving 
treatment-dependent Z variables, as in Figure 4(c). 

Theorem 3. The average causal effect P*(y\ 
do(x)) is transportable from II to II* if either one 
of the following conditions holds: 

1. P*(y|do(x)) is trivially transportable. 

2. There exists a set of covariates, Z (possibly af¬ 
fected by X) such that Z is S-admissible and for 
which P*(z| do(x)) is transportable. 

3. There exists a set of covariates, W that satisfy 
{XMlY\W) d—— and for which P*(w |do(x)) is 
transportable. 

Proof. 1. Condition 1 entails transportability. 

2. If condition 2 holds, it implies 

P*(y\ do(x)) 

(5.1) 

= P(y\ do(x), s) 

(5.2) = Y p (y\ do (x),z, s)P(z\ do(x), s) 

Z 

(5.3) =Y p (y\do(x),z)P*(z\do(x)). 

Z 

We now note that the transportability of P{z | do(x)) 
should reduce P*(z| do(x)) to a star-free expression 
and would render P*(y|do(x)) transportable. 

3. If condition 3 holds, it implies 

p *(y \ do(x)) 

(5.4) 

= P(y\ do(x), s) 

(5.5) = p (v\ do (x),w, s)P(w\ do(x), s) 


(5.6) = Y. p {v\ w i s)P*{w | do(x)) 

W 

(by Rule 3 of do-calculus) 

(5.7) = P*(y\w)P*(w\ do(x)). 

W 

We similarly note that the transportability of P*(w | 
do(x)) should reduce P(w\ do(x), s) to a star-free 
expression and would render P*(y|do(x)) trans¬ 
portable. This proves Theorem 3. □ 

Example 8. To illustrate the application of 
Theorem 3, let us apply it to Figure 4(c), which 
corresponds to the surrogate endpoint problem dis¬ 
cussed in Section 3 (Example 3). Our goal is to 
estimate P*(y|do(x))—the effect of I on I in 
the new population created by changes in how 
Z responds to X. The structure of the problem 
permits us to satisfy condition 2 of Theorem 3, 
since Z is 5-admissible and P*(z|do(x)) is triv¬ 
ially transportable. The former can be seen from 
(S_U_y|X,Z) Gx , hence P*(y\ do(x), z) =P(y|do(x), 
z))\ the latter can be seen from the fact that X and 
Z and unconfounded, hence P*(z\ do(x)) = P*(z\x). 
Putting the two together, we get 

(5.8) P*(y\do(x)) = Y p (y\do{x),z)P*(z\x), 

Z 

which proves equation (3.4). 

Remark. The test entailed by Theorem 3 is re¬ 
cursive, since the transportability of one causal ef¬ 
fect depends on that of another. However, given 
that the diagram is finite and acyclic, the sets Z 
and W needed in conditions 2 and 3 of Theorem 3 
would become closer and closer to X, and the it¬ 
erative process will terminate after a finite num¬ 
ber of steps. This occurs because the causal ef- 
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fects P*(z\ do(ic)) (likewise, P*(w\ do(x))) is triv¬ 
ially transportable and equals P(z) for any Z node 
that is not a descendant of X. Thus, the need for re¬ 
iteration applies only to those members of Z that he 
on the causal pathways from X to Y. Note further 
that the analyst need not terminate the procedure 
upon satisfying the conditions of Theorem 3. If one 
wishes to reduce the number of experiments, it can 
continue until no further reduction is feasible. 

Example 9. Figure 6(d) requires that we invoke 
both conditions of Theorem 3, iteratively. To satisfy 
condition 2, we note that Z is S-admissible, and we 
need to prove the transportability of P*(z| do(x)). 
To do that, we invoke condition 3 and note that W 
d-separates X from Z in D. There remains to con¬ 
firm the transportability of P*(w\ do(x)), but this 
is guaranteed by the fact that the empty set is 
•S-admissible relative to W, since (WJL5). Hence, 
by Theorem 2 (replacing Y with W) P*(u;| do(x)) 
is transportable, which bestows transportability on 
P*(y|do(x)). Thus, the final transport formula (de¬ 
rived formally in the Appendix) is: 

P*(y \ do(x)) 

(5.9) =^2P(y\do{x),z) 

Z 

• ^ P(w\ do(x))P*(z|?u). 

W 

The first two factors of the expression are estimable 
in the experimental study, and the third through ob¬ 
servational studies on the target population. Note 
that the joint effect P(y,w,z\ do(x)) need not be es¬ 
timated in the experiment; a decomposition that re¬ 
sults in decrease of measurement cost and sampling 
variability. 

A similar analysis proves the transportability of 
the causal effect in Figure 6(e) (see Pearl and 
Bareinboim (2011)). The model of Figure 6(f), how¬ 
ever, does not allow for the transportability of 
P*(y |do(x)) as witnessed by the absence of S- 
admissible set in the diagram, and the inapplica¬ 
bility of condition 3 of Theorem 3. 

Example 10. To illustrate the power of The¬ 
orem 3 in discerning transportability and deriving 
transport formulae, Figure 7 represents a more intri¬ 
cate selection diagram, which requires several itera¬ 
tion to discern transportability. The transport for¬ 
mula for this diagram is given by (derived formally 



Fig. 7. Selection diagram in which the causal effect is shown 
to be transportable in multiple iterations of Theorem 3 (see the 
Appendix). 

in the Appendix): 

P*(y|do(x)) 

(5.10) =^P(y\do(x),z) 

Z 

P(w\ do (x),t)P*(t). 

W t 

The main power of this formula is to guide in¬ 
vestigators in deciding what measurements need be 
taken in both the experimental study and the target 
population. It asserts, for example, that variables U 
and V need not be measured. It likewise asserts that 
the VF-specific causal effects need not be estimated 
in the experimental study and only the conditional 
probabilities P*(z\w) and P*(t) need be estimated 
in the target population. The derivation of this for¬ 
mulae is given in the Appendix. 

Despite its power, Theorem 3 in not complete, 
namely, it is not guaranteed to approve all trans¬ 
portable relations or to disapprove all nontrans¬ 
portable ones. An example of the former is contrived 
in Bareinboim and Pearl (2012), where an alter¬ 
native, necessary and sufficient condition is estab¬ 
lished in both graphical and algorithmic form. The¬ 
orem 3 provides, nevertheless, a simple and powerful 
method of establishing transportability in practice. 

6. CONCLUSIONS 

Given judgements of how target populations may 
differ from those under study, the paper offers a for¬ 
mal representational language for making these as¬ 
sessments precise and for deciding whether causal 
relations in the target population can be inferred 
from those obtained in an experimental study. When 
such inference is possible, the criteria provided by 
Theorems 2 and 3 yield transport formulae, namely, 
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principled ways of calibrating the transported re¬ 
lations so as to properly account for differences in 
the populations. These transport formulae enable 
the investigator to select the essential measurements 
in both the experimental and observational studies, 
and thus minimize measurement costs and sample 
variability. 

The inferences licensed by Theorem 2 and 3 repre¬ 
sent worst case analysis, since we have assumed, in 
the tradition of nonparametric modeling, that every 
variable may potentially be an effect-modifier (or 
moderator). If one is willing to assume that certain 
relationships are noninteractive, or monotonic as is 
the case in additive models, then additional trans¬ 
port licenses may be issued, beyond those sanctioned 
by Theorems 2 and 3. 

While the results of this paper concern the trans¬ 
fer of causal information from experimental to ob¬ 
servational studies, the method can also benefit 
in transporting statistical findings from one obser¬ 
vational study to another (Pearl and Bareinboim 
(2011)). The rationale for such transfer is two-fold. 
First, information from the first study may enable 
researchers to avoid repeated measurement of cer¬ 
tain variables in the target population. Second, by 
pooling data from both populations, we increase 
the precision in which their commonalities are esti¬ 
mated and, indirectly, also increase the precision by 
which the target relationship is transported. Sub¬ 
stantial reduction in sampling variability can be 
thus achieved through this decomposition (Pearl 
(2012b)). 

Clearly, the same data-sharing philosophy can be 
used to guide Meta-Analysis (Glass (1976); Hedges 
and Olkin (1985); Rosenthal (1995); Owen (2009)), 
where one attempts to combine results from many 
experimental and observational studies, each con¬ 
ducted on a different population and under a differ¬ 
ent set of conditions, so as to construct an aggregate 
measure of effect size that is “better,” in some for¬ 
mal sense, than any one study in isolation. While 
traditional approaches aims to average out differ¬ 
ences between studies, our theory exploits the com¬ 
monalities among the populations studied and the 
target population. By pooling together commonali¬ 
ties and discarding areas of disparity, we gain max¬ 
imum use of the available samples (Bareinboim and 
Pearl (2013c)). 

To be of immediate use, our method relies on the 
assumption that the analyst is in possession of suf¬ 
ficient background knowledge to determine, at least 
qualitatively, where two populations may differ from 


one another. This knowledge is not vastly different 
from that required in any principled approach to 
causation in observational studies, since judgement 
about possible effects of omitted factors is crucial 
in any such analysis. Whereas such knowledge may 
only be partially available, the analysis presented in 
this paper is nevertheless essential for understand¬ 
ing what knowledge is needed for the task to succeed 
and how sensitive conclusions are to knowledge that 
we do not possess. 

Real-life situations will be marred, of course, with 
additional complications that were not addressed 
directly in this paper; for example, measurement 
errors, selection bias, finite sample variability, un¬ 
certainty about the graph structure and the pos¬ 
sible existence of unmeasured confounders between 
any two nodes in the diagram. Such issues are not 
unique to transportability; they plague any problem 
in causal analysis, regardless of whether they are 
represented formally or ignored by avoiding formal¬ 
ism. The methods offered in this paper are represen¬ 
tative of what theory permits us to do in ideal situa¬ 
tions, and the graphical representation presented in 
this paper makes the assumptions explicit and trans¬ 
parent. Transparency is essential for reaching tenta¬ 
tive consensus among researchers and for facilitating 
discussions to distinguish that which is deemed plau¬ 
sible and important from that which is negligible or 
implausible. 

Finally, it is important to mention two recent 
extensions of the results reported in this article. 
Bareinboim and Pearl (2013a) have addressed the 
problem of transportability in cases where only 
a limited set of experiments can be conducted 
at the source environment. Subsequently, the re¬ 
sults were generalized to the problem of “meta¬ 
transportability,” that is, pooling experimental re¬ 
sults from multiple and disparate sources to syn¬ 
thesize a consistent estimate of a causal relation at 
yet another environment, potentially different from 
each of the former (Bareinboim and Pearl (2013c)). 
It is shown that such synthesis may be feasible from 
multiple sources even in cases where it is not feasible 
from any one source in isolation. 

APPENDIX 

Derivation of the transport formula for the causal 
effect in the model of Figure 6(d) [equation (5.9)]: 

P*(y\do(x)) 

= P(y\do(x),s) 
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= Y P (y \ d °( x )> s ’ Z ) P ( Z \do(x),s) 

Z 

= X] P ( y l do(x), z)P(z\ do(x), s) 

Z 

(2nd condition of Theorem 3, 
5-admissibility of Z of CE(X,Y)) 

= Y p (y\M x U) 

z 

■ P(z | do(x), w, s)P(w | do(x), s) 

W 

= Y p (y\M x U) 

z 

(A.l) P(z\w, s)P(u>| do(x), s) 

W 

(3rd condition of Theorem 3, 

(X1LZ\W,S) D —) 

= 5I F (!/l do ( I ). 2: ) 

z 

■ Y^ P(z\w, s)P(u;| do(x)) 

W 

(2nd condition of Theorem 3, 
5-admissibility of the 
empty set {} of CE(X,W)) 

=Y p ( y \ do ( x ')’ z ^ 

z 

■ Y^ P*{z\w)P{w\ do(x)). 

W 

Derivation of the transport formula for the causal 
effect in the model of Figure 7 [equation (5.10)]: 

P*{y\ do(x)) 

= P(y\ do(x), s, s') 

= Y do ( x )> s > s/ > z ) p ( z \ do(x),s, s') 

Z 

= Y do(x),z)P(z\ do(x), s, s') 

Z 

(2nd condition of Theorem 3, 
5-admissibility of Z of CE(X, Z)) 

= Y p (y\Mx),z) 


■ Y^ p ( z \ do(x), s, s', w)P(w | do(x), s, s') 

W 

=Y p ( y \ do ( x ^ z ') 

z 

■ Y; p i z \ s i s 'i w)P(w\ do(x), s, s') 

W 

(3rd condition of Theorem 3, 

(XALZ\W,S,S>) ) 

(A.2) 

= p (y\ d °( x )> z ) Y 

z w 

■ Y^ p { w I do(x), s, s', t)P(t\ do(x), s, s') 

t 

= Y p ( y \ d °(x), z ) Y w ) 

z w 

■ Y^ P(w \ do(x), t)P(t \ do(x), s, s') 

t 

(2nd condition of Theorem 3, 

5-adnrissibility of T on CE(X,W)) 

= Y P ( y \do{x),z) Y p ( z \ s > w) 

Z W 

■Y P (. w \do(x),t)P(t\s,s') 

t 

(1st condition of Theorem 3/ 

Rule 3 of do-calculus, (A_ILT|5, S') D ) 

=Y p ( y \ do ( x )> z ) Y P *( z \ w ) 

Z W 

■Y P ( w \do(x),t)P*(t). 

t 
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